import pandas as pd
import numpy as np
import dalex as dx
import shap
import pickle
np.random.seed = 42
X_train = pd.read_csv("./src/X_train.csv")
y_train = pd.read_csv("./src/y_train.csv")
X_test = pd.read_csv("./src/X_test.csv")
y_test = pd.read_csv("./src/y_test.csv")
X_train = X_train.drop("Unnamed: 0",axis=1)
y_train = y_train.drop("Unnamed: 0",axis=1)
X_test = X_test.drop("Unnamed: 0",axis=1)
y_test = y_test.drop("Unnamed: 0",axis=1)
# XGBoost
gbm = pickle.load(open("./src/gbm.pickle", 'rb'))
# Gradient Boosting
gbc = pickle.load(open("./src/gbc.pickle", 'rb'))
# Random Forest
rfc = pickle.load(open("./src/rfc.pickle", 'rb'))
# SVM
svm = pickle.load(open("./src/svm.pickle", 'rb'))
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
gbmr = gbm.predict(X_test)
gbcr = gbc.predict(X_test)
rfcr = rfc.predict(X_test)
svmr = svm.predict(X_test)
results = {
"algorithm" : ['XGBoost','Gradient Boosting','Random Forest','SVM'],
"accuracy" : [accuracy_score(y_test,gbmr),accuracy_score(y_test,gbcr),accuracy_score(y_test,rfcr),accuracy_score(y_test,svmr)],
"precision" : [precision_score(y_test,gbmr),precision_score(y_test,gbcr),precision_score(y_test,rfcr),precision_score(y_test,svmr)],
"recall" :[recall_score(y_test,gbmr),recall_score(y_test,gbcr),recall_score(y_test,rfcr),recall_score(y_test,svmr)],
'ROC AUC' : [roc_auc_score(y_test,gbmr),roc_auc_score(y_test,gbcr),roc_auc_score(y_test,rfcr),roc_auc_score(y_test,svmr)]
}
pd.DataFrame(results)
| algorithm | accuracy | precision | recall | ROC AUC | |
|---|---|---|---|---|---|
| 0 | XGBoost | 0.8050 | 0.806306 | 0.836449 | 0.802633 |
| 1 | Gradient Boosting | 0.8075 | 0.818605 | 0.822430 | 0.806376 |
| 2 | Random Forest | 0.7850 | 0.801887 | 0.794393 | 0.784293 |
| 3 | SVM | 0.7475 | 0.798942 | 0.705607 | 0.750653 |
explainer_gbc = dx.Explainer(gbc, X_train, y_train,label="Gradient Boosting")
explainer_gbm = dx.Explainer(gbm, X_train, y_train,label="XGBoost")
explainer_svm = dx.Explainer(svm, X_train, y_train,label="SVM")
explainer_rfc = dx.Explainer(rfc, X_train, y_train,label="Random Forest")
Preparation of a new explainer is initiated -> data : 1199 rows 11 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 1199 values -> model_class : sklearn.model_selection._search.GridSearchCV (default) -> label : Gradient Boosting -> predict function : <function yhat_proba_default at 0x0000014FA47A79D0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.00902, mean = 0.535, max = 0.993 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.486, mean = 7.72e-06, max = 0.485 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 1199 rows 11 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 1199 values -> model_class : sklearn.model_selection._search.RandomizedSearchCV (default) -> label : XGBoost -> predict function : <function yhat_proba_default at 0x0000014FA47A79D0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0178, mean = 0.534, max = 0.992 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.498, mean = 0.000165, max = 0.418 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 1199 rows 11 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 1199 values -> model_class : sklearn.model_selection._search.GridSearchCV (default) -> label : SVM -> predict function : <function yhat_default at 0x0000014FA47A7940> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0, mean = 0.505, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -1.0, mean = 0.03, max = 1.0 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 1199 rows 11 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 1199 values -> model_class : sklearn.model_selection._search.RandomizedSearchCV (default) -> label : Random Forest -> predict function : <function yhat_proba_default at 0x0000014FA47A79D0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0106, mean = 0.535, max = 0.999 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.56, mean = -0.000809, max = 0.529 -> model_info : package sklearn A new explainer has been created!
fgbm = explainer_gbm.model_parts()
fgbm.result
| variable | dropout_loss | label | |
|---|---|---|---|
| 0 | _full_model_ | 0.000000 | XGBoost |
| 1 | free sulfur dioxide | 0.000580 | XGBoost |
| 2 | fixed acidity | 0.000768 | XGBoost |
| 3 | residual sugar | 0.000985 | XGBoost |
| 4 | pH | 0.001235 | XGBoost |
| 5 | density | 0.001530 | XGBoost |
| 6 | citric acid | 0.001705 | XGBoost |
| 7 | chlorides | 0.005214 | XGBoost |
| 8 | volatile acidity | 0.008567 | XGBoost |
| 9 | total sulfur dioxide | 0.012446 | XGBoost |
| 10 | sulphates | 0.048579 | XGBoost |
| 11 | alcohol | 0.058437 | XGBoost |
| 12 | _baseline_ | 0.510284 | XGBoost |
fgbm.loss_function
<function dalex.model_explanations._variable_importance.loss_functions.loss_one_minus_auc(observed, predicted)>
fgbm.plot()
Najważniejsze spostrzeżenia:
gbm_model = gbm.best_estimator_
shap.summary_plot(shap.TreeExplainer(gbm_model).shap_values(X_train),X_train, plot_type="bar")
Kolejność najważniejszych zmiennych w obu metodach jest prawie identyczna z wyjątkiem całkowitej zawartości dwutlenku siarki (total sulfur dioxide) i wartości kwasowości lotnej (volatile acidity). Zawartość alkoholu i siarczynów w uśrednionych wartosciach SHAP także mniej odstaje od reszty zmiennych.
frf = explainer_rfc.model_parts()
fsvm = explainer_svm.model_parts()
fgbc = explainer_gbc.model_parts()
fgbm.plot([fsvm,frf,fgbc])
Wnioski:
X_stand=(X_train-X_train.mean())/X_train.std()
svm_2 = svm
svm_2.fit(X_stand, y_train)
explainer_svm2 = dx.Explainer(svm_2, X_stand, y_train,label="Standarized SVM")
explainer_svm2.model_parts().plot()
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
Preparation of a new explainer is initiated -> data : 1199 rows 11 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 1199 values -> model_class : sklearn.model_selection._search.GridSearchCV (default) -> label : Standarized SVM -> predict function : <function yhat_default at 0x000001DF76502A60> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0, mean = 0.53, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -1.0, mean = 0.005, max = 1.0 -> model_info : package sklearn A new explainer has been created!
Standaryzacja prawie nie zmieniła kolejności najważniejszych zmiennych, jednak wyrównała wartośći dla mniej ważnych zmiennych.
pdp_gbm = explainer_gbm.model_profile()
pdp_gbm.plot()
Calculating ceteris paribus: 100%|█████████████| 11/11 [00:00<00:00, 26.51it/s]
pdp_gbc = explainer_gbc.model_profile()
pdp_rfc = explainer_rfc.model_profile()
pdp_svm = explainer_svm.model_profile()
pdp_gbm.plot([ pdp_rfc, pdp_svm,pdp_gbc])
Calculating ceteris paribus: 100%|█████████████| 11/11 [00:00<00:00, 28.18it/s] Calculating ceteris paribus: 100%|█████████████| 11/11 [00:52<00:00, 4.80s/it] Calculating ceteris paribus: 100%|█████████████| 11/11 [00:38<00:00, 3.54s/it]
Wnioski:
ale_gbm = explainer_gbm.model_profile(type = 'accumulated')
ale_gbm.result['_label_'] = "ALE XGBoost"
pdp_gbm.result['_label_'] = "PDP XGBoost"
ale_gbm.plot(pdp_gbm)
Calculating ceteris paribus: 100%|█████████████| 11/11 [00:00<00:00, 26.83it/s] Calculating accumulated dependency: 100%|██████| 11/11 [00:01<00:00, 9.68it/s]
ale_gbc = explainer_gbc.model_profile(type = 'accumulated')
ale_rfc = explainer_rfc.model_profile(type = 'accumulated')
ale_svm = explainer_svm.model_profile(type = 'accumulated')
ale_rfc.result['_label_'] = "ALE Random Forest"
ale_gbc.result['_label_'] = "ALE Gradient Boosting"
ale_svm.result['_label_'] = "ALE SVM"
ale_gbm.plot([ale_rfc,ale_svm,ale_gbc])
Calculating ceteris paribus: 100%|█████████████| 11/11 [00:00<00:00, 28.21it/s] Calculating accumulated dependency: 100%|██████| 11/11 [00:01<00:00, 9.64it/s] Calculating ceteris paribus: 100%|█████████████| 11/11 [00:41<00:00, 3.74s/it] Calculating accumulated dependency: 100%|██████| 11/11 [00:01<00:00, 9.74it/s] Calculating ceteris paribus: 100%|█████████████| 11/11 [00:28<00:00, 2.60s/it] Calculating accumulated dependency: 100%|██████| 11/11 [00:01<00:00, 9.05it/s]
Najważniejsze spostrzeżenia: